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SEGMENTAL TONAL MODELING FOR TONAL LANGUAGES 

BACKGROUND OF THE INVENTION 
The present invention relates generally to 
the field of speech processing systems such as speech 
5 recognizers and text-to-speech converters. More 
specifically, the present invention relates to 
modeling units or set design, used in such systems. 

Selecting the most suitable units, i.e. 
modeling units, to represent salient acoustic and 

10 phonetic information for a language is an important 
issue in designing a workable speech processing 
system such as a speech recognizer or text-to-speech 
converter. Some important criteria for selecting the 
appropriate modeling units include how accurate the 

15 modeling units can represent words, particularly in 
different word contexts; how trainable is the 
resulting model and whether parameters of units can 
be estimated reliably with enough data; and whether 
new words can be easily derived from the predefined 

20 unit inventory, i.e., whether the resulting model is 
generalizable . 

Besides the overall factors to consider as 
provided above, there are several layers of units to 
be considered: phones, syllables and words. Their 

25 performances in term of the above criteria are very 
different. Word-based units should be a good choice 
for domain specific, such as a speech recognizer 
designed for digits. However, for LVCSR (large 
vocabulary, continuous speech recognizer) , phone- 
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based units are better since they are more trainable 
and generalizable . 

Many speech processing systems now use 
context-dependent phones, like tri -phones, in the 
5 context of a state-sharing technology, e.g. Hidden 
Markov Modeling. The resulting systems have yielded 
good performance, particularly for western languages 
such as English. This is due in part to the smaller 
phone set of the western languages (e.g. English 

10 comprises only about 50 phones) , which when modeled 
as context-dependent phones, like tri-phones, 
although theoretically would entail 50 3 different tri- 
phones, practically such systems use less and are 
considered both trainable and generalizable. 

15 Although phone -based systems such as 

systems based on Hidden Markov Modeling of triphones 
has been shown to work well with western languages 
like English, speech processing systems for tonal 
languages like Chinese have generally used syllables 

20 as the basis of the modeled unit. Compared with most 
western languages, there are several distinctive 
characteristics or differences of a tonal language 
such as Chinese Mandarin. First, the number of words 
is unlimited, while number of characters and 

25 syllables are fixed. Specifically, one Chinese 
character corresponds to one syllable. In total, 
there are about 420 base syllables and more than 1200 
tonal ones. 

Since Chinese is a tonal language, for each 
30 syllable, there are usually five tone types from tone 
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1 to tone 5, like {/mal/ /ma2/ /ma3/ /ma4/ /ma5/}. 
Among the 5 tones, first four ones are normal tones, 
which have the shape of High Level, Rising, Low level 
and Falling. The fifth tone is a neutralization of 
5 the other four. Although the phones are the same, the 
real acoustic realizations are different because of 
the different tone types. 

In addition to the 1-1 mapping between 
character and syllable, a defined structure exists 

10 inside the syllable. Specifically, each base syllable 
can be represented with the following form: 

(C) + (G) V (V, N) 
According to Chinese phonology, the first part before 
w + " is called initials, which mainly consists of 

15 consonants. There are 22 initials in Chinese and one 
of it is a zero initial, representing the cases when 
initials are absent. Parts after " + " are called 
finals. There are about 38 finals in Mandarin 
Chinese. Here (G) , V and (V, N) are called head 

20 (glide) , body (main vowel) and tail (coda) of finals 
respectively. Units in brackets are optional in 
constructing valid syllables. 

As mentioned above, syllables have 
generally formed the basis of the modeled unit in a 

25 tonal language such as Mandarin Chinese. Such a 
system has generally not been used for western 
languages because of thousands of possible syllables 
exist. However, such representation is very accurate 
for Mandarin Chinese and the number of units is also 

30 acceptable. However, the number of tri-syllables is 
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very large and tonal syllables make the situation 
even worse. Therefore, most of the current modeling 
strategies for Mandarin Chinese are based on the 
decomposition of syllable. Among them, syllables are 
5 usually decomposed into initial and final parts, 
while tone information is modeled separately or 
together with final parts. Nevertheless, shortcomings 
still exist with these systems and an improved 
modeling unit set is certainly desired. 

10 SUMMARY OF THE INVENTION 

A phone set for use in speech processing 
such as speech recognition or text-to-speech 
conversion is used to model or form syllables of a 
tonal language having a plurality of different tones. 

15 In one embodiment, each syllable includes an initial 
part that can be glide dependent and a final part. 
The final part includes a plurality of segments or 
phones. Each segment carries categorical tonal 
information such that the segments taken together 

20 implicitly and jointly represent the different tones. 
Since a tone contains two segments, one phone only 
takes part of the tone information and the two phones 
in a final part work together to represent the whole 
tone information. Stated yet another way, a first set 

25 of the plurality of phones is used to describe the 
initials, while a second set is used to describe the 
finals . 

Embodied as either a speech processing 
system or a method of speech processing, the phone 
30 set is accessed and utilized to identify syllables in 
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an input for performing one of speech recognition and 
text-to-speech conversion. An output is then provided 
corresponding to one of speech recognition and text- 
to-speech conversion. 
5 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a general 
computing environment in which the present invention 
can be useful . 

FIG. 2 is a block diagram of a speech 
10 processing system. 

FIG. 3 is a block diagram of a text-to- 
speech converter. 

FIG. 4 is a block diagram of a speech 
recognition system. 
15 FIG. 5 is graph illustrating tone types in 

Mandarin Chinese. 

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS 

Prior to discussing the present invention 
in greater detail, an embodiment of an illustrative 
2 0 environment in which the present invention can be 
used will be discussed. FIG. 1 illustrates an example 
of a suitable computing system environment 100 on 
which the invention may be implemented. The 
computing system environment 100 is only one example 
25 of a suitable computing environment and is not 
intended to suggest any limitation as to the scope of 
use or functionality of the invention. Neither should 
the computing environment 100 be interpreted as 
having any dependency or requirement relating to any 
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one or combination of components illustrated in the 
exemplary operating environment 100. 

The invention is operational with numerous 
other general purpose or special purpose computing 
5 system environments or configurations. Examples of 
well known computing systems, environments, and/or 
configurations that may be suitable for use with the 
invention include, but are not limited to, personal 
computers, server computers, hand-held or laptop 

10 devices, multiprocessor systems, microprocessor-based 
systems, set top boxes, programmable consumer 
electronics, network PCs, minicomputers, mainframe 
computers, distributed computing environments that 
include any of the above systems or devices, and the 

15 like. 

The invention may be described in the 
general context of computer-executable instructions, 
such as program modules, being executed by a 
computer. Generally, program modules include 

20 routines, programs, objects, components, data 
structures, etc. that perform particular tasks or 
implement particular abstract data types. Those 
skilled in the art can implement the description 
and/or figures herein as computer-executable 

25 instructions, which can be embodied on any form of 
computer readable media discussed below. 

The invention may also be practiced in 
distributed computing environments where tasks are 
performed by remote processing devices that are 

30 linked through a communications network. In a 
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distributed computing environment, program modules 
may be located in both local and remote computer 
storage media including memory storage devices . 

With reference to FIG. 1, an exemplary 
5 system for implementing the invention includes a 
general purpose computing device in the form of a 
computer 110. Components of computer 110- may 

include, but are not limited to, a processing unit 
12 0, a system memory 13 0, and a system bus 121 that 

10 couples various system components including the 
system memory to the processing unit 12 0. The system 
bus 121 may be any of several types of bus structures 
including a memory bus or memory controller, a 
peripheral bus, and a local bus using any of a 

15 variety of bus architectures. By way of example, and 
not limitation, such architectures include Industry 
Standard Architecture (ISA) bus, Micro Channel 
Architecture (MCA) bus, Enhanced ISA (EISA) bus, 
Video Electronics Standards Association (VESA) local 

2 0 bus, and Peripheral Component Interconnect (PCI) bus 
also known as Mezzanine bus. 

Computer 110 typically includes a variety 
of computer readable media. Computer readable media 
can be any available media that can be accessed by 

2 5 computer 110 and includes both volatile and 
nonvolatile media, removable and non-removable media. 
By way of example, and not limitation, computer 
readable media may comprise computer storage media 
and communication media. Computer storage media 

30 includes both volatile and nonvolatile, removable and 
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non-removable media implemented in any method or 
technology for storage of information such as 
computer readable instructions, data structures, 
program modules or other data. Computer storage 
5 media includes, but is not limited to, RAM, ROM, 
EE PROM, flash memory or other memory technology, CD- 
ROM, digital versatile disks (DVD) or other optical 
disk storage, magnetic cassettes, magnetic tape, 
magnetic disk storage or other magnetic storage 

10 devices, or any other medium which can be used to 
store the desired information and which can be 
accessed by computer 110. Communication media 

typically embodies computer readable instructions, 
data structures, program modules or other data in a 

15 modulated data signal such as a carrier wave or other 
transport mechanism and includes any information 
delivery media. The term "modulated data signal" 
means a signal that has one or more of its 
characteristics set or changed in such a manner as to 

20 encode information in the signal. By way of example, 
and not limitation, communication media includes 
wired media such as a wired network or direct -wired 
connection, and wireless media such as acoustic, RF, 
infrared and other wireless media. Combinations of 

25 any of the above should also be included within the 
scope of computer readable media. 

The system memory 13 0 includes computer 
storage media in the form of volatile and/or 
nonvolatile memory such as read only memory (ROM) 131 

30 and random access memory (RAM) 132. A basic 
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input/output system 133 (BIOS) , containing the basic 
routines that help to transfer information between 
elements within computer 110, such as during start- 
up, is typically stored in ROM 131. RAM 132 typically 
5 contains data and/or program modules that are 
immediately accessible to and/or presently being 
operated on by processing unit 120. By way of 
example, and not limitation, FIG. 1 illustrates 
operating system 134, application programs 135, other 

10 program modules 136, and program data 137. 

The computer 110 may also include other 
removable/non-removable volatile/nonvolatile computer 
storage media. By way of example only, FIG. 1 
illustrates a hard disk drive 141 that reads from or 

15 writes to non-removable , nonvolatile magnetic media, 
a magnetic disk drive 151 that reads from or writes 
to a removable, nonvolatile magnetic disk 152, and an 
optical disk drive 155 that reads from or writes to a 
removable, nonvolatile optical disk 156 such as a CD 

2 0 ROM or other optical media. Other removable/non- 
removable, volatile/nonvolatile computer storage 
media that can be used in the exemplary operating 
environment include, but are not limited to, magnetic 
tape cassettes, flash memory cards, digital versatile 

25 disks, digital video tape, solid state RAM, solid 
state ROM, and the like. The hard disk drive 141 is 
typically connected to the system bus 121 through a 
non-removable memory interface such as interface 14 0, 
and magnetic disk drive 151 and optical disk drive 
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155 are typically connected to the system bus 121 by 
a removable memory interface, such as interface 150. 

The drives and their associated computer 
storage media discussed above and illustrated in FIG. 
5 1, provide storage of computer readable instructions, 
data structures, program modules and other data for 
the computer 110. In FIG. 1, for example, hard disk 
drive 141 is illustrated as storing operating system 
144, application programs 145, other program modules 

10 146, and program data 147. Note that these components 
can either be the same as or different from operating 
system 134, application programs 135, other program 
modules 136, and program data 13 7. Operating system 
144, application programs 145, other program modules 

15 146, and program data 147 are given different numbers 
here to illustrate that, at a minimum, they are 
different copies. 

A user may enter commands and information 
into the computer 110 through input devices such as a 

20 keyboard 162, a microphone 163, and a pointing device 
161, such as a mouse, trackball or touch pad. Other 
input devices (not shown) may include a joystick, 
game pad, satellite dish, scanner, or the like. 
These and other input devices are often connected to 

2 5 the processing unit 120 through a user input 

interface 160 that is coupled to the system bus, but 
may be connected by other interface and bus 
structures, such as a parallel port, game port or a 
universal serial bus (USB) . A monitor 191 or other 

3 0 type of display device is also connected to the 
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system bus 121 via an interface, such as a video 
interface 190. In addition to the monitor, computers 
may also include other peripheral output devices such 
as speakers 197 and printer 196, which may be 
5 connected through an output peripheral interface 195. 

The computer 110 may operate in a networked 
environment using logical connections to one or more 
remote computers, such as a remote computer 180. The 
remote computer 180 may be a personal computer, a 

10 hand-held device, a server, a router, a network PC, a 
peer device or other common network node, and 
typically includes many or all of the elements 
described above relative to the computer 110. The 
logical connections depicted in FIG. 1 include a 

15 local area network (LAN) 171 and a wide area network 
(WAN) 173, but may also include other networks. Such 
networking environments are commonplace in offices, 
enterprise-wide computer networks, intranets and the 
Internet . 

2 0 When used in a LAN networking environment, 

the computer 110 is connected to the LAN 171 through 
a network interface or adapter 170. When used in a 
WAN networking environment, the computer 110 
typically includes a modem 172 or other means for 
25 establishing communications over the WAN 173, such as 
the Internet. The modem 172, which may be internal 
or external, may be connected to the system bus 121 
via the user-input interface 160, or other 
appropriate mechanism. In a networked environment, 

3 0 program modules depicted relative to the computer 
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110, or portions thereof, may be stored in the remote 
memory storage device. By way of example, and not 
limitation, FIG. 1 illustrates remote application 
programs 185 as residing on remote computer 180. It 
5 will be appreciated that the network connections 
shown are exemplary and other means of establishing a 
communications link between the computers may be * 
used. 

FIG. 2 generally illustrates a speech 

10 processing system 200 that receives an input 202 to 
provide an output 204, derived, in part, from a phone 
set described below. For example, the speech 
processing system 2 00 can be embodied as a speech 
recognizer that receives as an input, spoken words or 

15 phrases such as through microphone 163 to provide an 
output comprising text, for example, stored in any of 
the computer readable media storage devices. In 
another embodiment, the speech processing system 2 00 
can be embodied as a text -to- speech converter that 

2 0 receives text, embodied for example on a computer 
readable media, and provides as an output speech that 
can be rendered to the user through speaker 197. It 
should be understood that these components can be 
provided in other systems, and as such are further 

25 considered speech processing systems as used herein. 

During processing, the speech processing system 
2 00 accesses a module 206 derived from the phone set 
discussed below in order to process the input 2 02 and 
provide the output 204. The module 206 can take many 

30 forms for example a model, database, etc. such as an 
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acoustic model used in speech recognition or a unit 
inventory used in concatenative text -to- speech 
converters. The phone set forming the basis of module 
2 06 is a segmental tonal model of a tonal language such 
5 as, but not limited to, Chinese (Mandarin, which is 
described below by way of example) Vietnamese, and 
Thai etc., including dialects thereof. 

An exemplary text-to-speech converter 300 for 
converting text to speech is illustrated in FIG. 3. 

10 Generally, the converter 3 00 includes a text analyzer 
302 and a unit concatenation module 304. Text to be 
converted into synthetic speech is provided as an input 
306 to the text analyzer 302. The text analyzer 302 
performs text normalization, which can include 

15 expanding abbreviations to their formal forms as well 
as expanding numbers, monetary amounts, punctuation and 
other non-alphabetic characters into their full word 
. equivalents. The text analyzer 3 02 then converts the 
normalized text input to a string of sub- word elements, 

2 0 such as phonemes, by known techniques. The string of 
phonemes is then provided to the unit concatenation 
module 304. If desired, the text analyzer 302 can 
assign accentual parameters to the string of phonemes 
using prosodic templates, not shown. 

25 The unit concatenation module 304 receives the 

phoneme string and constructs synthetic speech input, 
which is provided as an output signal 308 to a digital- 
to-analog converter 310, which in turn, provides an 
analog signal 312 to the speaker 197. Based on the 

30 string input from the text analyzer 302, the unit 
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concatenation module 304 selects representative 
instances from a unit inventory 316 after working 
through corresponding decision trees stored at 318. 
The unit inventory 316 is a store of context-dependent 
5 units of actual acoustic data, such as in decision 
trees. In one embodiment, triphones (a phoneme with 
its one immediately preceding and succeeding phonemes 
as the context) are used for the context-dependent 
units. Other forms of units include quinphones and 

10 diphones. The decision trees 318 are accessed to 
determine which unit is to be used by the unit 
concatenation module 3 04. In one embodiment, the unit 
is one phone for each of the phones of the phone set 
discussed below. 

15 The phone decision tree 318 is a binary tree that 

is grown by splitting a root node and each of a 
succession of nodes with a linguistic question 
associated with each node, each question asking about 
the category of the left (preceding) or right 

20 (following) phone. The linguistic questions about a 
phone's left or right context are usually generated by 
an expert in linguistics in a design to capture 
linguistic classes of contextual effects based on the 
phone set discussed below. In one embodiment, Hidden 

25 Markov Models (HMM) are created for each unique 
context-dependent phone-based unit. Clustering is 
commonly used in order to provide a system that can run 
efficiently on a computer given its capabilities. 

As stated above, the unit concatenation module 3 04 

30 selects the representative instance from the unit 
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inventory 316 after working through the decision trees 
318. During run time, the unit concatenation module 
304 can either concatenate the best preselected phone- 
based unit or dynamically select the best phone-based 
5 unit available from a plurality of instances that 
minimizes a joint distortion function. In one 
embodiment, the joint distortion function is a 
combination of HMM score, phone -based unit 
concatenation distortion and prosody mismatch 

10 distortion. The system 300 can be embodied in the 
computer 110 wherein the text analyzer 302 and the unit 
concatenation module 3 04 are hardware or software 
modules, and where the unit inventory 316 and the 
decision trees 318 can be stored using any of the 

15 storage devices described with respect to computer 110. 

As appreciated by those skilled in the art, other 
forms of text-to-speech converters can be used. 
Besides the concatenative synthesizer 304 described 
above, articulator synthesizers and formant 

20 synthesizers can also be used to provide text-to-speech 
conversion. 

In a further embodiment, the speech processing 
system 200 can comprise a speech recognition module or 
speech recognition system, an exemplary embodiment of 

25 which is illustrated in FIG. 4 at 400. The speech 
recognition system 400 receives input speech from the 
user at 402 and converts the input speech to the text 
404. The speech recognition system 400 includes the 
microphone 163, an analog-to-digital (A/D) converter 

30 403, a training module 405, feature extraction module 
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406, a lexicon storage module 410, an acoustic model 
412, a search engine 414, and a language model 415. 
It should be noted that the entire system 400, or 
part of speech recognition system 400, can be 
5 implemented in the environment illustrated in FIG. 1. 
For example, microphone 163 can preferably be 
provided as an input device to the computer 110, 
through an appropriate interface, and through the A/D 
converter 403. The training module 405 and feature 

10 extraction module 4 06 can be either hardware modules 
in the computer 110, or software modules stored in 
any of the information storage devices disclosed in 
FIG. 1 and accessible by the processing unit 12 0 or 
another suitable processor. In addition, the lexicon 

15 storage module 410, the acoustic model 412, and the 
language model 415 are also preferably stored in any 
of the memory devices shown in FIG. 1. Furthermore, 
the search engine 414 is implemented in processing 
unit 120 (which can include one or more processors) 

20 or can be performed by a dedicated speech recognition 
processor employed by the personal computer 110. 

In the embodiment illustrated, during speech 
recognition, speech is provided as an input into the 
system 400 in the form of an audible voice signal by 

25 the user to the microphone 163. The microphone 163 
converts the audible speech signal into an analog 
electronic signal, which is provided to the A/D 
converter 403. The A/D converter 403 converts the 
analog speech signal into a sequence of digital 

3 0 signals, which is provided to the feature extraction 
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module 406. In one embodiment, the feature extraction 
module 406 is a conventional array processor that 
performs spectral analysis on the digital signals and 
computes a magnitude value for each frequency band of a 
5 frequency spectrum. The signals are, in one 
illustrative embodiment, provided to the feature 
extraction module 406 by the A/D converter 4 03 at a 
sample rate of approximately 16 kHz, although other 
sample rates can be used. 

10 The feature extraction module 406 divides the 

digital signal received from the A/D converter 403 into 
frames that include a plurality of digital samples. 
Each frame is approximately 10 milliseconds in 
duration. The frames are then encoded by the feature 

15 extraction module 406 into a feature vector reflecting 
the spectral characteristics for a plurality of 
frequency bands. In the case of discrete and semi- 
continuous Hidden Markov Modeling, the feature 
extraction module 406 also encodes the feature vectors 

2 0 into one or more code words using vector quantization 
techniques and a codebook derived from training data. 
Thus, the feature extraction module 4 06 provides, at 
its output the feature vectors (or code words) for each 
spoken utterance. The feature extraction module 406 

2 5 provides the feature vectors (or code words) at a rate 

of one feature vector or (code word) approximately 
every 10 milliseconds. 

Output probability distributions are then computed 
against Hidden Markov Models using the feature vector 

3 0 (or code words) of the particular frame being analyzed. 
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These probability distributions are later used in 
executing a Viterbi or similar type of processing 
technique . 

Upon receiving the code words from the feature 
5 extraction module 406, the search engine 414 accesses 
information stored in the acoustic model 412. The model 
412 stores acoustic models, such as Hidden Markov 
Models, which represent speech units to be detected by 
the speech recognition system 400. In one embodiment, 

10 the acoustic model 412 includes a senone tree 
associated with each Markov state in a Hidden Markov 
Model. The Hidden Markov models represent the phone 
set discussed below. Based upon the senones in the 
acoustic model 412, the search engine 414 determines 

15 the most likely phones represented by the feature 
vectors (or code words) received from the feature 
extraction module 406, and hence representative of the 
utterance received from the user of the system. 

The search engine 414 also accesses the lexicon 

20 stored in module 410. The information received by the 
search engine 414 based on its accessing of the 
acoustic model 412 is used in searching the lexicon 
storage module 410 to determine a word that most likely 
represents the codewords or feature vector received 

25 from the features extraction module 406. Also,, the 
search engine 414 accesses the language model 415, 
which can take many different forms such those 
employing N-grams, context-free grammars or 
combinations thereof. The language model 415 is also 

3 0 used in identifying the most likely word represented by 
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the input speech. The most likely word is provided as 
output text 404. 

As appreciated by those skilled in the art, other 
forms of speech recognition systems can used. Besides 
5 the Hidden Markov Modeling recognizer described above, 
recognizers based on Artificial Neural Network (ANN) , 
Dynamic Time Wrapping (DTW) respectively or the 
combinations of them like hybrid ANN-HMM system etc. 
can also benefit from modules derived from the phone 
10 set described below. 

As discussed in the Background section 
above, a base syllable in Chinese can be represented 
with the following form: 

(C) + (G) V (V, N) 

15 

where, the first part before w + " is called initials, 
which mainly consists of consonants, and the parts 
after are called finals, and where (G) , V and (V, 

N) are called head (glide) , body (main) and tail 
20 (coda) of finals respectively, and the units in 
brackets are optional in constructing valid 
syllables . 

At this point it should be noted that the form 
provided above is used herein for purposes of 

25 explaining aspects of the present invention; however 
this form should not be considered required or 
limiting. In other words, it should be understood 
different forms may be used as alternative structures 
for describing syllables in Chinese and other tonal 

30 languages, and that specific details, beyond those 
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discussed below, regarding such forms are essentially 
independent of the phone set described herein. 

In general, a new phone set, herein called 
segmental tonal modeling, comprises three parts for 
5 each syllable of the form: 

CG VI V2 



Where CG corresponds to (C) (G) in the form mentioned 
10 above, but includes the glide, thereby yielding a 
glide-dependent initial. However, use of the word 
"initial" should not be confused with "initial" as 
used above since the glide, which was considered part 
of the final has been now associated with this first 
15 part. Assigning the glide to the initial or first 
part extends the unit inventory from that of the 
first form. 

With respect to Chinese Mandarin, there are only 
three valid glides /u/, /u/ (to simplify the 

20 labeling, /v/ is used to represent /u/) and /i/, so 
each initial consonant is classified into four 
categories at most. In fact, most of them have only 2 
or 3 three categories since some consonant -glide 
combinations are invalid in Mandarin. For example, 

25 for consonant /t/, there exists /t/, /ti/ and /tu/, 
while for consonant /j/, there exists only /ji/ and 
/jv/. 

VI and V2, of the present inventive form, 
collectively provide the remaining syllable 
30 information (refer as main final in this invention) 
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including the tonal information. VI can be considered 
as representing a first portion of the main final 
information, which may in some syllables represent 
the first vowel if the main final contains two 
5 phonemes and in some syllables represent the first 
portion of the phoneme if the main final has only one 
phoneme, and carries or includes a first portion of 
tonal information as well. V2 can be considered as 
representing a second portion of the main final 

10 information, which may in some syllables represent 
the second phonemes when the main final contains two 
phonemes and in some syllables represent the second 
portion of the phoneme when the main final has only 
one phoneme, and carries or includes a second portion 

15 of tonal information. In other words, instead of 
modeling tone types directly, tones are realized 
implicitly and jointly by a plurality of parts, e.g. 
two parts or segments (herein also called "segmental 
toneme" ), which both carry tonal information. 

20 Associated with each of VI and V2 is tonal 

information. As is known in Mandarin Chinese, there 
exists five different tones, four of which are 
illustrated in FIG. 5. The fifth tone is a 
neutralization mode of the other four. In the 

2 5 embodiment described herein, the different tone types 
are described by the combination of three categorical 
pitch levels according to their relative pitch 
region, herein illustrated as high (H) , medium (M) 
and low (L) , i.e. the tone types illustrated in FIG. 

30 5 can be classified in categorical levels as high- 
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high (HH) for tone 1, low-high (LH) or middle-high 
(MH) for tone 2, low-low (LL) for tone 3 and high-low 
(HL) for tone 4. Tone 5, the neutral tone, can either 
share the pattern of tone 4 or tone 3 according to 
5 the previous tone types, or be modeled separately as 
medium-medium (MM) . The first mark in the tone 
pattern is attached to VI and the second part is 
attached to V2 . 

At this point an example may be helpful. Table 1 
10 below provides the decomposition of the tonal 
syllable /zhuang$/ where $={1,2,3,4,5} represents the 
five different tones for the syllable. 

Table 1 

15 



Tonal syllable 


CG 


VI 


V2 


/zhuangl/ 




/aaH/ 


/ngH/ 


/zhuang2/ 




/aaL/or/aaM/ 


/ngH/ 


/zhuang3/ 


/ZHU/ 


/aaU 


/ngL/ 


/zhuang4/ 




/aaH/ 


/ngL/ 


/zhuang5/ 




/aaM/ 


/ngM/ 



In the present inventive form, {zhu} and {aaH, aaM, 
aaL, ngH, nhM, ngL} become a part of final phone set. 

2 0 As mentioned above, instead of appending 5 tones into 
Final parts (/uang/) , the glide /u/ is assigned into 
Initial part /zh/, forming /zhu/. The remainder part 
/ang$/ of the syllable is segmented into two phonemes 
/a/+/ng/ and labeled as /aa/+/ng/ based on phonology, 

25 then tone 1-5 are realized by combinations of H/L/M, 
which finally attached with the corresponding 
phonemes (like /aa/ and /ng/) . 

In some syllables, the final part contains only 



-23- 

one phoneme, such as /zha/ . Nevertheless, the final 
part is segmented into two parts (/aa/ for VI and 
/aa/ for V2) to achieve consistency in syllable 
decomposition. Table 2 illustrates the decomposition 
5 of /zha$/ using the present inventive form. 



Table 2 



Tonal syllable 


CG 


VI 


V2 


/zhal/ 


/ZH/ 


/aH/ 


/aH/ 


/zha2/ 


/aL/ or 
/aM/ 


/aH/ 


/zha3/ 


/aL/ 


/aL/ 


/zha4/ 


/aH/ 


/aL/ 


/zha5/ 


/aM/ 


/aM/ 



Using the techniques described above, a phone 
set with 97 units (plus /sil/ for silence) can be 

10 realized in which 57 are used to describe glide 
dependent initials and remaining 3 9 are used to 
describe final parts (VI and V2) . Table 3 provides 
the phone list comprising 97 units (plus /sil/) where 
the left column is initial-related, while the right 

15 column provides segmental tonemes that correspond to 
the main final parts. It should be noted that to keep 
the consistent decomposition structure for all valid 
syllables; several phone units are explicitly created 
for syllables without initial consonants, i.e. the so 

20 called zero-initial case, which are denoted as /ga/, 
/ge/ and /go/ in Table 3 . The second symbol in them 
is decided by the first phoneme of the final parts. 
E.g. the CG for syllable /anl/ is /ga/ and the CG for 
syllable /enl/ is /ge/ . However, doing this is not 

25 necessary if the speech processing system does not 
require the same syllable structure all the time. And 
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in some realizations, the three can be merged into 



one . 



Table 3 

5 



B 


bi 


bu 




aaM 


aaH 


aaL 


C 




cu 




aM 


aH 


al_ 


Ch 




chu 




ehM 


ehH 


ehL 


D 


di 


du 




elM 


elH 


elL 


F 




fu 




erM 


erH 


erL 


9 




gu 




ibM 


ibH 


ibL 


/ga/ge/go (zero 


-initials) 




ifM 


ifH 


ifl_ 


H 




hu 




iM 


iH 


iL 


ji jv 


ngM 


ngH 


ngL 


K 




ku 




nnM 


nnH 


nnL 


L 


li 


lu 


Iv 


oM 


oH 


oL 


M 


mi 


mu 




uM 


uH 


uL 


N 


ni 


nu • 


nv 


vM 


vH 


vL 


P 


P' 


pu 




sil 




qi 




qv 




R 




ru 






S 




su 






Sh 




shu 






T 


ti 


tu 






wu 






xi 




XV 






yi 




yv 




Z 




zu 






Zh 




zhu 







A detailed phone list and the syllable to phone 
10 set mappings are indicated below. However, it should 
be noted that phoneme /a/ in /ang/ and in /an/ is 
represented by different symbols (phones) /a/ and 
/aa/ in Table 4 because the place of articulation of 
the two are slightly different. These phones can be 
15 merged to form one unit, if a smaller phone set is 
desired or there does not exist a sufficient amount 
of training data. Another pair that can be merged is 
/el/ and /eh/. 
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The full list of mappings between syllable and 
phone inventory can be deduced from Table 3 and Table 
4. As we mentioned at background parts, there are 
about more than 420. base syllables and more than 1200 
5 tonal syllables. To save the space,, instead of 
listing all mapping pairs, only the mappings between 
the standard finals (38) and the phones in inventory 
are listed in Table 4. The full list between 
syllables and phones can be easily extracted 

10 according to the decomposition method introduced 
above and Table 4. For example, for syllable /tiao4/, 
which consists initial t and final /iao/ and tone 4, 
Table 4 indicates that /iao/-> /i/+/aa/+/o/ . Based on 
above decomposition strategy, glides /i/ will be 

15 merged with initial and formed glide-dependent 
initial /ti/ while tone 4 will be decomposed to HL, 
therefore, the mapping of tonal syllable /tiao4/ 
become /tiao4/->/ti/+/aaH/+/oL/ . In addition, 
basically, VI and V2 of the inventive form should 

2 0 have both tonal tags such as H, M and L, while VI and 
V2 shown at Table 4 are the just the base form of 
phoneme without tonal tags. 

Table 4 

Decomposition table for all 
25 standard Finals without tone 



Finals 


Glides 


V1 


V2 


a 




a 


a 


ai 




a 


eh 


an 




a 


nn 


ang 




aa 


ng 


ao 




aa 


0 


e 




el 


el 


ei 




eh 


i 
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en 


— 


el 


nn 


eng 




el 


ng 


er 




er 




i 


1 


i 


i 


ia 


1 


a 


a 


ian 


1 


a 


nn 


iang 




aa 


ng 


iao 




aa 


o 


ib(for the hi in /zhi/) 


• 


ib 


ib 


le 




eh 


eh i 


if (for the /i/ in /zi/) 


i 


if 


if 


in 


i 


i 


nn 


ing 




ei 


ng 


iong 




u 


ng 


iu 




0 


u 


0 


u 


u 


o 


ong 


u 


u 


ng 


ou 


~ 


o 


u 


u 


u 


u 


u 


ua 


u 


a 


a 


uai 


u 


a 


eh 


uan 


u 


a 


nn 


uang 


u 


aa 


ng 


UI 


u 


eh 


i 


un 


u 


el 


nn 


uo 


u 


0 


0 


V 


V 


V 


V 


van 


V 


eh 


nn 


ve 


V 


eh 


eh 


vn 


V 


el 


nn 



Use of the phone set construction as described 
above can provide several significant advantages 
including that the phone set for a tonal language 
5 such as Chinese has been reduced, while maintaining 
necessary distinction for accuracy in both speech 
recognition and text -to- speech conversion. In 
addition, the syllable construction is also 
consistent with findings and descriptions of 

10 phonologist on tones such as tones found in Chinese. 
Syllables created using the construction above are 
also consistent, regardless of presence of optional 
parts. In addition, syllables embodied as three parts 
(initial and two part finals) is more suitable for 

15 the state-of-the-art search framework and therefore 
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yields more efficiency than with normal 2 -part 
decomposition of syllables during fan-out extensions 
in speech recognition. Furthermore, each tonal 
syllable has a fixed segment structure (e.g. three 
5 segments) , which can be potentially applied to 
decoding as a constraint to improve the search 
efficiency. Finally, detailed modeling of initials by 
building glide -dependent initials can aid in 
distinguishing each of the initials from each other. 

10 Although the present invention has been 

described with reference to particular embodiments, 
workers skilled in the art will recognize that 
changes may be made in form and detail without 
departing from the spirit and scope of the invention. 

15 For example, under the basic idea of representing the 
typical tone types with segmental toneme, this 
concept can easily extend the current 2 -value 
(High/Low) quantization on pitch level into more 
detailed levels, such as 3 -value (as High/Middle/Low) 

20 or even 5-value (like 1-5) to depict the pattern of 
the typical tone types in .details, if desired. If 
five values are used for Mandarin Chinese tones, the 
following representation could be used: 5-5 for tone 
1, 3-5 or 2-5 for tone 2, 2-1 for tone 3 and 5-1, 5-2 

25 or 4-1 for tone 4. However, it should be more 
meaningful for tonal languages with more tone types, 
such as Cantonese, which has about nine tone types. 
Cantonese is a very important dialect of Chinese 
commonly used by Hongkong, south of China, overseas 

30 Chinese, etc. 



